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Controllable Kevin Spacey Model 


Figure 1. Model of Kevin Spacey (bottom), derived from Internet Photos, is controlled by his own photos or videos of other celebrities 
(top). The Kevin Spacey model captures his personality and behavior, while mimicking the pose and expression of the controllers. 


Abstract 

We reconstruct a controllable model of a person from 
a large photo collection that captures his or her persona, 
i.e., physical appearance and behavior. The ability to op¬ 
erate on unstructured photo collections enables modeling a 
huge number of people, including celebrities and other well 
photographed people without requiring them to be scanned. 
Moreover, we show the ability to drive or puppeteer the cap¬ 
tured person B using any other video of a different person A. 
In this scenario, B acts out the role of person A, but retains 
his/her own personality and character. Our system is based 
on a novel combination of 3D face reconstruction, tracking, 
alignment, and multi-texture modeling, applied to the pup- 
peteering problem. We demonstrate convincing results on 
a large variety of celebrities derived from Internet imagery 
and video. 


1. Introduction 

Kevin Spacey has appeared in many acting roles over the 
years. He’s played characters with a wide variety of temper¬ 
aments and personalities. Yet, we always recognize him as 
Kevin Spacey. Why? Is it his shape? His appearance? The 
way he moves? 

Inspired by Doersch et al’s “What Makes Paris Look 
Like Paris” [10], who sought to capture the essence of a 


city, we seek to capture an actor’s persona. But what de¬ 
fines a persona, how can we capture it, and how will we 
know if we’ve succeeded? 

Conceptually, we want to capture how a person appears 
in all possible scenarios. In the case of famous actors, 
there’s a wealth of such data available, in the form of photo¬ 
graphic and video (film and interview) footage. If, by using 
this data, we could somehow synthesize footage of Kevin 
Spacey in any number and variety of new roles, and they 
all look just like Kevin Spacey, then we have arguably suc¬ 
ceeded in capturing his persona. 

Rather than creating new roles from scratch, which 
presents all sorts of challenges unrelated to computer vi¬ 
sion, we will assume that we have video footage of one per¬ 
son (actor A), and we wish to replace him with actor B, 
performing the same role. More specifically, we define the 
following problem: 

Input: 1) all available photos and/or videos of actor B, and 
2) a photo collection and a single video V of actor A 

Output: a video V' of actor B performing the same role 
as actor A in V, but with B’s personality and character. 

Figure 1 presents example results with Kevin Spacey as ac¬ 
tor B, and two other celebrities (Daniel Craig and George 
Bush) as actor A. 

The problem of using one face to drive another is a form 
of puppetry, which has been explored in the graphics liter- 
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ature e.g., [20, 25, 18]. The term avatar is also used some¬ 
times to denote this concept of a puppet. What makes our 
work unique is that we derive the puppet (actor B) automat¬ 
ically from large photo collections. 

Our answer to the question of what makes Kevin Spacey 
look like Kevin Spacey is in the form of a demonstration, 

1. e., a system that is capable of very convincing renderings 
of one actor believably mimicking the behavior of another. 
Making this work well is challenging, as we need to de¬ 
termine what aspects are preserved from actor A’s perfor¬ 
mance and actor B’s personality. For example, if actor A 
smiles, should actor B smile in the exact same manner? Or 
use actor B’s own particular brand of smile? After a great 
deal of experimentation, we obtained surprisingly convinc¬ 
ing results using the following simple recipe: use actor B’s 
shape, B’s texture, and A’s motion (adjusted for the geom¬ 
etry of B’s face). Both the shape and texture model are de¬ 
rived from large photo collections of B, and A’s motion is 
estimated using a 3D optical flow technique. 

We emphasize that the novelty of our approach is in our 
application and system design, not in the individual techni¬ 
cal ingredients. Indeed, there is a large literature on face re¬ 
construction [9, 2, 14, 21] and tracking and alignment tech¬ 
niques e.g., [24, 15]. Without a doubt, the quality of our 
results is due in large part to the strength of these underly¬ 
ing algorithms from prior work. Nevertheless, the system 
itself, though seemingly simple in retrospect, is the result of 
many design iterations, and shows results that no other prior 
art is capable of. Indeed, this is the first system capable of 
building puppets automatically from large photo collections 
and driving them from videos of other people. 

2. Related Work 

Creating a realistic controllable model of a person’s face 
is an extremely challenging problem due to the high de¬ 
gree of variability in the human face shape and appearance. 
Moreover, the shape and texture are highly coupled: when 
a person smiles, the 3D mouth and eye shape changes, and 
wrinkles and creases appear and disappear which changes 
the texture of the face. 

Most research on avatars focuses on animated, non¬ 
human faces [18, 24]. The canonical example is that a per¬ 
son drives an animated character, e.g., a dog, with his/her 
face. The drivers face can be captured by a webcam or 
structured light device such as Kinect, the facial expressions 
are then transferred to a blend shape model that connects 
the driver and the puppet and then coefficients of the blend 
shape model are applied to the puppet to create a similar fa¬ 
cial expression. Recent techniques can operate in real-time, 
with a number of commercial systems now available, e.g., 
faceshift.com (based on [24]), and Adobe Project Animal 
[ 1 ]. 

The blend shape model that is used for non-human 


avatars typically capture only large scale expression de¬ 
formations. Capturing fine details remains an open chal¬ 
lenge. Some authors have explored alternatives to blend 
shape models for non-human characters by learning shape 
transfer functions [29], and dividing the shape transfer to 
several layers of detail [27, 18]. 

Creating a model of a real person, however, is an even 
more challenging task due to the extreme detail that is re¬ 
quired for creation of a realistic face. One way of capturing 
fine details is by having the person participate in a sequence 
of lab sessions and use multiple synchronized and calibrated 
lights and camera rigs [2] . For example, light stages were 
used for creation of the Benjamin Button movie-to create 
an avatar of Brad Pitt in an older age [9]. For this. Brad 
Pitt had to participate in numerous sessions of capturing ev¬ 
ery possible expression his face can make according to the 
Facial Action coding system [11]. These expressions were 
later used to create a personalized blend shape model and 
transferred to an artist created sculpture of an older version 
of him. This approach produces amazing results, however, 
requires active participation of the person and takes months 
to execute. 

Automatic methods for expression transfer of people 
used multilinear models created from 3D scans [22] or 
structured light data [6], and transfered differences in ex¬ 
pressions of the driver’s mesh to the puppet’s mesh through 
direct deformation transfer, e.g., [20, 25, 18] or through 
coefficients that represent different face shapes [24, 22], 
or driven by speech [7]. These approaches either account 
only for large scale deformations or do not handle texture 
changes on the puppet. 

This paper is about creating expression transfer in 3D 
with high detail models and accounting for texture changes 
that occur due to expression change. Change in texture 
was previously considered by [19] via image based wrin¬ 
kles transfer using ratio images, where editing of facial 
expression used only a single photo [28], face swapping 
[3, 8], reenactment [12], and age progression [17]. These 
approaches changed a person’s appearance by transferring 
changes in texture from another person, and typically focus 
on a small range of expressions. Finally, [13] showed that it 
is possible to create a puppetry effect by simply comparing 
two youtube videos (of the driver and puppet) and finding 
similarly looking (based on metrics of [16]) pairs of pho¬ 
tos. However, the results simply recalled the best matching 
frame at each time instant, and did not synthesize contin¬ 
uous motion. In this paper, we show that it is possible to 
leverage a completely unconstrained photo collection of the 
person (e.g., Internet photos) in a simple but highly effective 
way to create texture changes, applied in 3D. 
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Figure 2. One goal of our system is to create a realistic puppet of any person which can be controlled by a photo or a video sequence. 
Both the driver and puppet only require a 2D photo collection. To produce the final textured model, we deform the average 3D shape 
of the puppet reconstructed from its own photo collection to the target expression by transfering the deformation from the driver. The 
texture of the final model is created separately for each frame via our texture synthesize process which produces detailed, consistent, and 
expression-dependent textures. 


3. Overview 

Given a photo collection of the driver and the puppet, our 
system (illustrated in Figure 2) first reconstructs a rigid 3D 
average model of the driver and the puppet. Next, given a 
video of the driver, it estimates 3D flow from each video 
frame to the driver’s average model. This flow is then trans- 
fered onto the average model of the puppet creating a se¬ 
quence of shapes that move like the driver (Section 4). In 
the next stage, high detail consistent texture is generated for 
each frame that accounts for changes in facial expressions 
(Section 5). 

4. 3D Dynamic Mesh Creation 

By searching for “Kevin Spacey” on Google’s image 
search we get a large collection of photos that are captured 
under various poses, expressions, and lightings. In this sec¬ 
tion, we describe how we estimate an average 3D model of 
the driver and the puppet, and deform it according to a video 
or a sequence of photos of the driver. Figure 3 illustrates the 
shape creation process. 


3D Average Model Estimation. We begin by detection 
of face and flducial points (corners of eyes, mouth, nose) in 
each photo using CMU’s face tracker IntraFace [26]. We 
next align all the faces to a canonical coordinate frame and 
reconstruct an average rigid 3D shape of Spacey’s face. For 
3D average shape reconstruction we follow [14] with the 
modiflcation of non-rigidly aligning photos prior to 3D re¬ 
construction. We describe the non-rigid alignment step in 
Section 5. The same reconstruction pipeline is applied on 
the driver and the puppet photo collections, resulting in two 
average 3D rigid models. 

Dynamic 3D Model. Next, we create a dynamic model of 
the puppet that is deformed according to the driver’s non- 
rigid motions. For the driver, we are given a video or se¬ 
quence of photos. The first step is to reconstruct the 3D flow 
that deforms the driver’s 3D average model to the expres¬ 
sion of the driver in every single frame of the input video, 
using the method of [21]. The geometric transformation is 
given as a 3D translation field T : ^ applied on a 

driver’s average shape. 

Given a reconstructed mesh at frame i of a driver 









































M}^{u,v) : parametrized on an image plane 

{u, v) from a depth map, and the average mesh over the 
entire frame sequence M^:), the goal is to transfer the trans¬ 
lation field Mjj — Md to the puppet’s base mesh Mp to 
produce Mp. To transfer the deformation, we first establish 
correspondence between Mp and Mp through a 2D optical 
flow algorithm between the puppet’s and driver’s 2D aver¬ 
ages from their photo collections. 

[15] has shown that we can obtain correspondence be¬ 
tween two very different people by projecting one average 
onto the appearance subspace of the other by this match¬ 
ing illumination, and then run an optical fiow algorithm 
between the resulting projections. With this fiow, we can 
apply the deformation of the driver on the same facial fea¬ 
tures of the puppet. However, the direct deformation from 
the driver may not be suitable for the puppet, for example, 
if their eye sizes are different, the deformation needed to 
blink will be different. We solve this by scaling the magni¬ 
tude of the deformation to fit each puppet as follows (Fig¬ 
ure 3): Let the deformation vector from the driver at ver¬ 
tex Mp{u,v), be A{u,v). We first find the nearest ver¬ 
tex to Md{u,v) -f A{u,v) in euclidean distance on the 
driver mesh, denoted by Mp{s,t). Through the fiow be¬ 
tween Mp and Mp we computed earlier, we can establish a 
corresponding pair {Mp{u\v'), Mp{s'^t')) on the puppet 
mesh. The magnitude-adjusted deformation at Mp{u',v') 
is then computed by A{u^ v){A{u^ v) • A') where A = 
and A' = Mp{s',t') — Mp{u',v'). In addition, since the 
fiow between the driver and puppet can be noisy around am¬ 
biguous, untextured regions, we perform the standard de- 
noising on the term f{u,v) = {A{u,v) ■ A') to obtain a 
regularized field ^). The particular denoising algo¬ 
rithm we use is ROF denoising with the Huber norm and 
TV regularization. The final puppet’s mesh is constructed 
as Mp(u,v) = Mp{u,v) + A(u,v)f''{u,v). 

5. High detail Dynamic Texture Map Creation 

In the previous section, we have described how to create 
a dynamic mesh of the puppet. This section will focus on 
creation of a dynamic texture. The ultimate set of texture 
maps should be consistent over time (no flickering, or color 
change), have the facial details of the puppet, and change 
according to the driver’s expression, i.e., when the driver is 
laughing, creases around the mouth and eye wrinkles may 
appear on the face. For the latter it is particularly impor¬ 
tant to account for the puppet’s identity-some people may 
have wrinkles while others won’t. Thus, a naive solution 
of copying the expression detail from the driver’s face will 
generally not look realistic. Instead, we leverage a large 
unconstrained photo collection of the puppet’s face. The 
key intuition is that to create a texture map of a smile, we 
can find many more smiles of the person in the collection. 
While these smiles are captured under different pose, light- 



Figure 3. Magnitude adjustment in deformation transfer. Let’s take 
an example of a blinking eye, and denote hy Md(u^v) a vertex on 
a driver’s upper eye lid. The vertex is moving down by A(u, v) 
toward Md{s, t) in order to blink. Let’s denote the correspond¬ 
ing vertex on the puppet mesh Mp{u'^v'). Our goal is to apply 
Aiu^v) to Mp(u\v'), it could happen, however, that the pup¬ 
pet’s eyes are bigger, thus we adjust the deformation and instead 
use A(u, u)(A(u, v) • A'). 
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Figure 6. A visualization of the results after each step of the tex¬ 
ture synthesis process to generate an average face of Tom Hanks, 
a) shows an average after all photos in the collection are frontal- 
ized by a 3D face template, b) after TPS warping, c) after dense 
warping, and d) the final texture after the multi-scale weighted av¬ 
erage which enhances facial details. 

ing, white balance, etc. they have a common high detail that 
can be transfered to a new texture map. 

Our method works as follows. Given the target expres¬ 
sion which is either the configuration of fiducials on the 
driver’s face (that represents e.g., a rough shape of a smile) 
or by a reference photo if the driver is the same person as 
the puppet, we first warp all the photos in the puppet’s col¬ 
lection to the given expression. We then create a multi-scale 
weighted average that perserves a uniform illumination rep¬ 
resented by the lower frequency bands and enhances details 
in the higher frequency bands. Next we explain each of 
these in more detail. 

Non-rigid warping of the photos. Each photo in the 
puppet’s photo collection has 49 fiducial points that we 
detected. Next we frontalize the face by marking the 
same fiducials on a generic 3D face model and solve a 
Perspective-n-Point problem to estimate the 3D pose of the 
face in the photo. The model is then back-projected to pro¬ 
duce a frontal-warp version of each photo. Let the rigid 
pose-corrected fiducials in each photo be G and 
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Non-rigidly Aligned Laplacian Pyramid Weight Pyramid 

to Reference Decomposition 


Figure 4. A diagram for synthesizing a texture for a given reference photo shown on the left. Each photo is non-rigidly aligned to the 
reference and decomposed into a Laplacian pyramid. The final output shown on the right is produced by computing a weighted average 
pyramid of all the pyramids and collapsing it. 



(i) (ii) (iii) (iv) (V) 


(a) (b) 

Figure 5. a) A comparison between our method (column v) and 3 baseline methods (columns ii-iv) to produce a texture that matches the 
target expressions given in the column i. Baseline results in column (ii) are produced by warping a single average texture to the target which 
lack details such as creases around the mouth when the subject is smiling in the second row. Baseline results in column (iii) is produced 
by taking a weighed average of the photo collection with identical weights used in our method (Eq. 3). The facial features such as mouth 
appear blurry and the colors of the faces appear inconsistent. Baseline results in column (iv) are produced similarly to column (iii), but 
each photo is warped using thin plate spline and dense warping to the reference before taking the average. The textures appear sharper but 
still have inconsistent colors. Our method in column v and image b) produces consistent, sharp textures with expression-dependent details. 


the target fiducials be . Given two sets of fiducials we 
estimate a smooth mapping r that transforms the i-th photo 
to the target expression using a smooth variant of thin-plate 
splines [23] which minimizes the following objective: 


mm y]] ||F* - r(F^) IP + A ff rl^ + 2rly + Ax Ay 

^ i=l 

(1) 


The optimal mapping r satisfying this objective can be rep¬ 
resented with a radial basis function 0(x) = x‘^ log x and 
efficiently solved with a linear system of equations [23]. 
Given the optimal r, we can then warp each face photo 
to the target expression by backward warping. However, 
this warping relies only on a sparse set of fiducials and 
the resulting warp field can be too coarse to capture shape 
changes required to match the target expression which re¬ 
sults in a blurry average around eyes and mouth (Figure 6 






















a) Before Detail b) After c) After 

Enhancement (Single-scale) (Multi-scale) 


Figure 7. a) shows Tom Hanks’ average before detail enhance¬ 
ment. b) and c) show the average after single-scale and multi-scale 
blending. 

b). To refine the alignment, we perform an additional dense 
warping step by exploiting appearance subspaces based on 
[21, 15]. The idea is to warp all photos to their average, 
which now has the target expression, through optical fiow 
between illumination-matched pairs. Specifically, let the 
face image after TPS warping be P, its projection onto the 
rank 4 appearance subspace of the TPS warped photos be 
P . The refined warp field is then simply the fiow from P to 
P . In the case where a reference photo is available, (Figure 
5), we can further warp the entire collection to the reference 
photo by computing an optical flow that warps P to 
denoted by and compute the final warp field by 

composing = F^i^fioFfT^jT 

from the fact that Ffi^fT is an identity mapping. 

Adding high-detail. Given the set of aligned face pho¬ 
tos, we compute a weighted average of the aligned photos 
where the weights measure the expression similarity to the 
target expression and the confidence of high-frequency de¬ 
tails. We measure expression similarity by L 2 norm of the 
difference between the source and target fiducial points, and 
high-frequency details by the response of a Laplacian filter. 
A spatially-varying weight for face i at pixel (j, k) is 
computed as: 

Wik = exp ■ {L%r (2) 

where is the response of a Laplacian filter on face im¬ 
age i at pixel (j, k). An average produced with this weight¬ 
ing scheme produces blending artifacts, for example if high- 
frequency details from many photos with various illumina¬ 
tions are blended together (Figure 7). To avoid this prob¬ 
lem, the blending is done in a multi-scale framework, which 
blends different image frequency separately. In particular, 
we construct a Laplacian pyramid for every face photo and 
compute the weighted average of each level from all the 
pyramids according to the normalized then collapse 
the average pyramid to create a final texture. 

With real photo collections, it is rarely practical to as¬ 
sume that the collection spans any expression under every 
illumination. One problem is that the final texture for dif¬ 
ferent expressions may be averaged from a subset of photos 


that have different mean illuminations which results in an 
inconsistency in the overall color or illumination of the tex¬ 
ture. This change in the color, however, is low-frequency 
and is mitigated in the multi-scale framework by preferring 
a uniform weight distribution in the lower frequency lev¬ 
els of the pyramid. We achieve this is by adding a uniform 
distribution term, which dominates the distribution in the 
coarser levels: 

wj, = (exp O) 

where I G — 1} and I = 0 represents the coarsest 

level of a pyramid with p levels, and r and P are constants. 

6. Experiments 

In this section, we describe implementation details, run¬ 
time, and our results. 

Implementation details In Section 4, the 3D average 
models for both driver and puppet are reconstructed us¬ 
ing [21] which outputs meshes as depth maps with a face 
size around 194 x 244 (width x height) pixels. To find 
correspondence between the driver and puppet for defor¬ 
mation transfer purpose, we project the 2D average of the 
puppet onto the rank-4 appearance subspace of the driver, 
then compute an optical flow using Ce Liu’s implementa¬ 
tion based on Brox et al.[4] and Bruhn et al. [5] with pa¬ 
rameters (a, ratio, minWidth, outer-,inner-,SOR-iterations) 
= (0.02, 0.85, 20,4,1,40). The ROF denoising algorithm 
used for adjusting deformation magnitude has only two pa¬ 
rameters: The weight constant for the TV regularization 
which is set to 1, and the Huber epsilon to 0.05. In Section 
5, A in TPS warping objective is set to 10, and the dense 
warping step uses the same optical flow implementation but 
with a = 0.3. For Equation 3, {a, P, r) = (1, 20,1), cr is 
typically set to 10 but can vary around 6 — 12 for different 
sizes and qualities of the photo collections (See 7). 

Runtime We test our system on a single CPU core of a 
quad-core Intel i7-4770@3.40GHz. Computing a deformed 
puppet mesh based on the driver sequence takes 0.2 second 
per frame with 0.13 second spent on denoising. Synthe¬ 
sizing a texture which includes TPS warping, dense warp¬ 
ing, and multi-scale pyramid blending takes 0.34 second per 
frame on average. 

Evaluating a puppetry system objectively are extremely 
hard, and there exists no accuracy metric or benchmark 
to evalute such system. Ground-truth shapes for evaluat¬ 
ing deformation transfer across two people cannot be cap¬ 
tured as this requires the puppet person whose shape will be 
captured, to perform exactly like a driver sequence, which 
is not possible unless the person is the driver themselves. 
However, such a setup of self puppetry to evalute the recon¬ 
structed geometry requires no deformation transfer and does 
not evaluate our system. Evaluating the synthesized textures 





Figure 8. The first row contains two frames from YouTube videos 
of Kevin Spacey and George W. Bush used as referenes for puppets 
of many celebrities in the following rows. 



Figure 9. We show 3 example subjects for 3D shape and texture 
reconstruction. The input is a set of photos with varying expres¬ 
sions and appearances, and the output is 3D textured shapes in the 
same expressions as the input. 


is also qualitative in nature as the average texture we gen¬ 
erate cannot be pixel-wise compared to the reference. We 
provide results and input references for qualitative compar¬ 
isons and point out areas where further improvement can be 
done. 

From Google Images, we gathered around 200 photos for 
celebrities and politicians in Figure 8. We generated output 
puppetry sequences of those people performing various fa¬ 
cial expressions driven by YouTube Videos of Kevin Spacey 
and George W. Bush in the top row. These 3D models are 
generated by warping an average model of each person with 
3D optical flow transfered from the driver (top). So, to ren¬ 
der these texture-mapped models, we only synthesize tex¬ 
tures in their neutral expressions for the average models but 
















use the target expressions to calculate the blending weights. 
The identities of these puppets are well-preserved and re¬ 
main recognizable even when driven by the same source, 
and the transformation provides plausible output for pup¬ 
pets with different genders, ethnicities, skin colors, or facial 
features. Facial details are enhanced and change dynami¬ 
cally according to the reference expressions, for example, 
in creases around the mouth in the last column. We strongly 
encourage the readers to watch our supplementary videos 
for these results. 

In Figure 5, we show the capability to recreate consistent 
textures with similar expressions as reference photos in the 
photo collection. In other words, we are able to “regener¬ 
ate” each photo in the entire collection so that they appear as 
if the person is performing different expressions within the 
same video or photograph captures. Note that each refer¬ 
ence here is part of the photo collection used in the averag¬ 
ing process. Texture results for references outside the photo 
collection is in Figure 9. We compare our method with 
3 baseline approaches: 1. A single static average is TPS 
warped to the reference. This approach produces textures 
that lack realistic changes such as wrinkles and creases, and 
shapes that only roughly match the reference (e.g. eyes in 
column (ii) second row which appear bigger than the refer¬ 
ence) because the warping can only rely on sparse fiducial 
points. 2. A weighted average of the photo collection us¬ 
ing identical weights as our method. With this approach, 
creases can be seen, but the overall texture colors appear 
inconsistent when there is a variation in the mean color of 
different high-weighted sets of photos. The overall textures 
look blurry as there is no alignment done for each photo, 
and the shapes (eyes in the third row) do not match the ref¬ 
erence when the number of similar photos in the collection 
is small. 3. An improved weighted average with prewarp¬ 
ing step which includes TPS and dense warping similar to 
our pipeline. The prewarping step improves the shapes and 
the sharpness of the faces, but the textures remain inconsis¬ 
tent. Our method in column (v) produces sharp, realistic, 
and consistent textures with expression-dependent details 
and is able to match references with strong illuminations or 
in black-and-white in Figure 5 (b). Since the references are 
part of the averaging process, some high-frequency details 
such as wrinkles are transfered to the output texture. How¬ 
ever, the low-frequency details such as shading effects, soft 
shadow under the nose (in the last example, middle row), 
or highlights (in the second example, last row) are averaged 
out in the multi-scale blending and are not part of the final 
textures. 

In Figure 9, we show self-puppetry results where we ren¬ 
der output 3D models from [21] with our textures. Similarly 
to Figure 8, we only synthesize textures in neutral expres¬ 
sions for the average models with blending weights calcu¬ 
lated based on the target expressions. The reference photos 


are excluded from the photo collection in the averaging pro¬ 
cess. Our textures remain consistent when the references 
have different lightings and look realistic from various an¬ 
gles. In the fourth reference in the last row, our textures 
have wrinkles but are less pronounced than the input ref¬ 
erence, which is due partly to the fact that the number of 
photos with wrinkles in the collection is less than 5%. 

7. Discussion 

The quality of the synthesized textures highy depends 
on many aspects of the photo collection which include 
the number and resolutions of the photos, expression and 
light varations. Since the textures are synthesized based 
on the assumption that we can find photos with similar ex¬ 
pressions, the results will degrade with smaller photo col¬ 
lection (less expression variation). In that situation, the 
method needs to take into account less-similar photos with 
a larger standard deviation in Equation 3 resulting in a less 
pronouced expression. If the standard deviation is kept 
small, high-frequency details can dicker when the rendered 
models from video input are played in sequence. Higher 
resolution photos directly contribute to a sharper average. 
Our method is less sensitive to having small light variations, 
in contrast to expression variations, because the shading dif¬ 
ferences are of low-frequency and can be shared across a 
wider range of photos in the coarser levels of pyramid. 

When a photo collection contains in the order of thou¬ 
sands photos such as when we extract frames from all 
movies starring a particular actress, additional characteris¬ 
tics of photos can be used to fine-tune the similarity mea¬ 
sure in the averaging process such as the directions of lights 
in the scene to enable a religthing capability or the age of 
the person (e.g. from a regressor) to synthesize textures at 
different ages. Only a small modification is needed to im¬ 
plement these changes in our framework. It is also useful to 
learn the association between the apperance of facial details 
and facial motions to help with unseen expressions that may 
share common facial details with already existing photos in 
the collection. 

8. Conclusion 

We presented the first system that allows reconstruction 
of a controllable 3D model of any person from a photo col¬ 
lection toward the goal of capturing persona. The recon¬ 
structed model has time-varying, expression-dependent tex¬ 
tures and can be controlled by a video sequence of a differ¬ 
ent person. This capability opens up the ability to create 
puppets for any photo collection of a person, without re¬ 
quiring them to be scanned. Furthermore, we believe that 
the insights from this approach (i.e., using actor B’s shape 
and texture but As motion), will help drive future research 
in this area. 
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